Fault Tolerant Computing on the Grid: What are My Options?

نویسنده

  • Jon B. Weissman
چکیده

High-performance distributed computing across wide-area networks has become an active topic of research [1][3][4][11]. Metasystem and grid software infrastructure projects, most notably, Legion [4] and Globus [3], have emerged to support this new computational paradigm. Achieving large-scale distributed computing in a seamless manner introduces a number of difficult problems. This paper examines one of the most critical problems, fault tolerance. A large wide-area system that contains hundreds to thousands of machines and multiple networks has a small mean time to failure. The most common failure modes include machine faults in which hosts go down and get rebooted, and network faults where links go down. A single monolithic solution for fault tolerance that is acceptable to all user applications is unlikely. For example, some applications may require continuous availability, or may require protection from byzantine failures, or require light-weight, low overhead fault tolerance. The most appropriate method for fault tolerance clearly may be application-specific. This follows the current trend in distributed systems and operating systems in which generic functions once performed within the “system” are now being moved to user-space for increased flexibility and performance. Because general purpose systems often impose a high cost on applications that do not fit their assumptions, the maxim “pay for what you need” has been proposed as a guiding principle for application-centric policy decisions in metacomputing systems such as Legion. We believe that the relative performance of fault tolerance methods is a key piece of information needed to enable users to make these decisions on behalf of their applications. This is particularly true for high-performance applications. To this end, we have examined fault tolerance options for a common class of high-performance parallel applications, single-program-multiple-data (SPMD). Performance models for two fault tolerance methods, checkpoint-recovery and wide-area replication, have been developed. These models enable quantitative comparisons of the two methods as applied to SPMD applications. While these

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Stability Assessment Metamorphic Approach (SAMA) for Effective Scheduling based on Fault Tolerance in Computational Grid

Grid Computing allows coordinated and controlled resource sharing and problem solving in multi-institutional, dynamic virtual organizations. Moreover, fault tolerance and task scheduling is an important issue for large scale computational grid because of its unreliable nature of grid resources. Commonly exploited techniques to realize fault tolerance is periodic Checkpointing that periodically ...

متن کامل

Fault Tolerant DNA Computing Based on ‎Digital Microfluidic Biochips

   Historically, DNA molecules have been known as the building blocks of life, later on in 1994, Leonard Adelman introduced a technique to utilize DNA molecules for a new kind of computation. According to the massive parallelism, huge storage capacity and the ability of using the DNA molecules inside the living tissue, this type of computation is applied in many application areas such as me...

متن کامل

An approach to fault detection and correction in design of systems using of Turbo ‎codes‎

We present an approach to design of fault tolerant computing systems. In this paper, a technique is employed that enable the combination of several codes, in order to obtain flexibility in the design of error correcting codes. Code combining techniques are very effective, which one of these codes are turbo codes. The Algorithm-based fault tolerance techniques that to detect errors rely on the c...

متن کامل

Fault Tolerant Reversible QCA Design using TMR and Fault Detecting by a Comparator Circuit

Quantum-dot Cellular Automata (QCA) is an emerging and promising technology that provides significant improvements over CMOS. Recently QCA has been advocated as an applicant for implementing reversible circuits. However QCA, like other Nanotechnologies, suffers from a high fault rate. The main purpose of this paper is to develop a fault tolerant model of QCA circuits by redundancy in hardware a...

متن کامل

Fault Tolerant Grid Migration Using Network Storage

We will present a protocol of fault tolerant grid migration system using network storage. Our proposed grid migration system has capability to recover from fault in autonomic manner. We also implemented a grid migration system using Globus Toolkit4 and the network storage. Performance evaluation will be reported to show effectiveness of the fault recovery.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999